This project aims to analyze historical data on herbicide production and market values from 2000 to 2023 and forecast sales for the next five years (2024-2028). Utilizing Python, we will process and analyze the dataset, identify trends, and implement a suitable forecasting method. The analysis will consider various factors such as area harvested, yield, and consumer price index changes, to provide accurate predictions for future market value of herbicides.
Data Loading and Preprocessing: Import the dataset into Python and perform necessary data cleaning and preprocessing steps to ensure the data is ready for analysis.
Exploratory Data Analysis (EDA): Conduct a thorough analysis of the data to identify trends, patterns, and correlations that may be useful for forecasting.
Model Selection and Implementation: Choose an appropriate forecasting model based on the insights gained from the EDA and implement it using Python.
Sales Forecast Generation: Generate sales forecasts for the years 2024-2028 using the selected model.
Model Evaluation and Presentation: Evaluate the accuracy of the forecasted sales data, ensuring clarity and organization of the code, and provide proper documentation and comments explaining the steps taken.
By following these steps, we aim to deliver a robust and accurate forecast of herbicide sales, supported by clear and well-documented code.
#Importing necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from statsmodels.tsa.holtwinters import ExponentialSmoothing
import warnings
# Suppress warnings for cleaner output
warnings.filterwarnings("ignore")
The libraries used in this code are as follows:
pandas: Provides data structures and data analysis tools for handling and manipulating structured data.
numpy: Supports large, multi-dimensional arrays and matrices, along with mathematical functions to operate on them.
matplotlib.pyplot: A plotting library used for creating static, animated, and interactive visualizations.
seaborn: A data visualization library based on matplotlib for drawing attractive and informative statistical graphics.
sklearn.model_selection.train_test_split: Splits datasets into training and testing sets for model evaluation.
sklearn.metrics.mean_squared_error: Evaluates regression model performance by calculating the mean squared error.
statsmodels.tsa.holtwinters.ExponentialSmoothing: Implements the Holt-Winters Exponential Smoothing method for time series forecasting.
warnings: Manages the display of warning messages to ensure cleaner output during code execution.
variance_inflation_factor from statsmodels.stats.outliers_influence: Calculates the Variance Inflation Factor (VIF) to detect multicollinearity in a set of predictor variables.
train_test_split from sklearn.model_selection: Splits arrays or matrices into random train and test subsets for model evaluation.
StandardScaler from sklearn.preprocessing: Standardizes features by removing the mean and scaling to unit variance.
LinearRegression from sklearn.linear_model: Implements ordinary least squares linear regression.
GradientBoostingRegressor from sklearn.ensemble: Implements gradient boosting for regression problems.
SVR from sklearn.svm: Supports vector regression for predicting continuous outputs.
xgb (XGBoost): Provides an optimized gradient boosting library designed for speed and performance.
MLPRegressor from sklearn.neural_network: Implements a multi-layer perceptron regressor for modeling complex relationships.
mean_squared_error, r2_score, mean_absolute_error from sklearn.metrics: Provides metrics to evaluate the performance of regression models.
plot_acf from statsmodels.graphics.tsaplots: Plots the autocorrelation function (ACF) for time series analysis.
boxcox from scipy.stats: Applies the Box-Cox transformation to make data more normally distributed.
Holt from statsmodels.tsa.holtwinters: Implements Holt’s linear trend method for forecasting.
go from plotly.graph_objects: Provides a high-level interface for creating interactive visualizations in Plotly.
This data dictionary provides a description of each column in the dataset and their corresponding attributes.
| Column Name | Description |
|---|---|
| Year | The year of the recorded data |
| Market_Value | Market value in USD |
| Area_Harvested | Area harvested in thousands of hectares (000 Ha) |
| Yield | Yield in tons per hectare (Tons/Ha) |
| CPI | Consumer Price Index, Year-on-Year percent change |
# Load the data from the Excel files
market_value_data = pd.read_excel('Agrochemical Market Value Data.xlsx')
variables_data = pd.read_excel('Variables Data.xlsx')
# Save the data to CSV files
market_value_data.to_csv('Agrochemical_Market_Value_Data.csv', index=False)
variables_data.to_csv('Variables_Data.csv', index=False)
print("Data has been successfully converted to CSV format.")
Data has been successfully converted to CSV format.
We converted the data from Excel to CSV format to simplify data handling, improve performance, ensure compatibility with various tools, and facilitate easier sharing and inspection. CSV files are lightweight, faster to read/write, and widely supported, making them ideal for data analysis and forecasting tasks.
# Load the data from the CSV files
market_value_data = pd.read_csv('Agrochemical_Market_Value_Data.csv')
variables_data = pd.read_csv('Variables_Data.csv')
# Display the first few rows of each DataFrame to verify the data
print("Market Value Data:")
print(market_value_data.head())
print("\nVariables Data:")
print(variables_data.head())
Market Value Data: Year Product Market Value (USD) 0 2000 Herbicides 103.298833767801 1 2001 Herbicides 114.172395217043 2 2002 Herbicides 90.613012077019 3 2003 Herbicides 108.735614492422 4 2004 Herbicides 117.796915700124 Variables Data: Year Product Area Harvested (000 Ha) Yield (Tons/Ha) \ 0 2000 Herbicides 3100.0 5.5 1 2001 Herbicides 2816.0 5.5 2 2002 Herbicides 2420.0 6.1 3 2003 Herbicides 2450.0 6.3 4 2004 Herbicides 2339.0 6.4 Consumer Price Index, Year-on-Year Percent Change 0 -0.9 1 -1.1 2 25.9 3 13.4 4 4.4
# Merge the two DataFrames on the 'Year' and 'Product' columns
merged_data = pd.merge(market_value_data, variables_data, on=['Year', 'Product'])
# Display the first few rows of the merged DataFrame to verify the merge
print("\nMerged Data:")
print(merged_data.head())
Merged Data: Year Product Market Value (USD) Area Harvested (000 Ha) \ 0 2000 Herbicides 103.298833767801 3100.0 1 2001 Herbicides 114.172395217043 2816.0 2 2002 Herbicides 90.613012077019 2420.0 3 2003 Herbicides 108.735614492422 2450.0 4 2004 Herbicides 117.796915700124 2339.0 Yield (Tons/Ha) Consumer Price Index, Year-on-Year Percent Change 0 5.5 -0.9 1 5.5 -1.1 2 6.1 25.9 3 6.3 13.4 4 6.4 4.4
Checking and updating datatype: Ensures data integrity and compatibility by converting columns to appropriate types, thereby enabling accurate data analysis, efficient storage, and proper execution of operations and algorithms.
# Check data types in the Market Value Data
print("\nData Types in Merged Data:")
print(merged_data.dtypes)
Data Types in Merged Data: Year int64 Product object Market Value (USD) object Area Harvested (000 Ha) float64 Yield (Tons/Ha) float64 Consumer Price Index, Year-on-Year Percent Change float64 dtype: object
# Drop the 'Product' column as it is redundant
merged_data.drop(columns=['Product'], inplace=True)
# Convert 'Market Value (USD)' to float
merged_data['Market Value (USD)'] = merged_data['Market Value (USD)'].replace('?', 0).astype(float)
# Display the first few rows of the DataFrame after type conversion
print("\nData after type conversion:")
print(merged_data.head())
# Check data types to confirm the conversion
print("\nData Types in Merged Data after conversion:")
print(merged_data.dtypes)
Data after type conversion: Year Market Value (USD) Area Harvested (000 Ha) Yield (Tons/Ha) \ 0 2000 103.298834 3100.0 5.5 1 2001 114.172395 2816.0 5.5 2 2002 90.613012 2420.0 6.1 3 2003 108.735614 2450.0 6.3 4 2004 117.796916 2339.0 6.4 Consumer Price Index, Year-on-Year Percent Change 0 -0.9 1 -1.1 2 25.9 3 13.4 4 4.4 Data Types in Merged Data after conversion: Year int64 Market Value (USD) float64 Area Harvested (000 Ha) float64 Yield (Tons/Ha) float64 Consumer Price Index, Year-on-Year Percent Change float64 dtype: object
We deleted the redundant column 'Product' because it contains the same value ('Herbicides') for all rows, providing no additional information for the forecasting analysis.
This code will rename the columns as follows:
'Year' -> 'Year' 'Market Value (USD)' -> 'Market_Value' 'Area Harvested (000 Ha)' -> 'Area_Harvested' 'Yield (Tons/Ha)' -> 'Yield' 'Consumer Price Index, Year-on-Year Percent Change' -> 'CPI' These new column names are simpler and devoid of extra symbols or characters, making them more suitable for faster data processing.
# Renaming columns
merged_data.columns = ['Year', 'Market_Value', 'Area_Harvested', 'Yield', 'CPI']
print(merged_data)
Year Market_Value Area_Harvested Yield CPI 0 2000 103.298834 3100.000000 5.5 -0.9 1 2001 114.172395 2816.000000 5.5 -1.1 2 2002 90.613012 2420.000000 6.1 25.9 3 2003 108.735614 2450.000000 6.3 13.4 4 2004 117.796916 2339.000000 6.4 4.4 5 2005 132.294998 2783.000000 7.4 9.6 6 2006 94.237533 2440.000000 6.5 10.9 7 2007 117.796916 2800.000000 8.0 8.8 8 2008 168.540202 3412.000000 6.5 8.6 9 2009 126.858217 2500.000000 6.2 6.3 10 2010 154.042121 3000.000000 8.3 10.5 11 2011 199.348627 3750.000000 6.7 9.8 12 2012 302.647460 3600.000000 5.8 10.0 13 2013 537.428899 4000.000000 6.8 10.6 14 2014 540.053552 3400.000000 7.6 21.4 15 2015 457.233259 3500.000000 8.5 16.4 16 2016 498.371566 3700.000000 8.0 37.5 17 2017 509.245128 4900.000000 8.4 25.4 18 2018 525.555470 5200.000000 6.2 34.2 19 2019 572.674236 6100.000000 8.4 52.8 20 2020 612.169011 6300.000000 8.1 40.5 21 2021 625.000000 6550.000000 7.9 47.1 22 2022 900.000000 7100.000000 7.0 73.1 23 2023 682.838834 6750.000000 5.3 134.0 24 2024 0.000000 6800.000000 8.1 272.2 25 2025 0.000000 7200.000000 7.6 70.9 26 2026 0.000000 7132.197788 7.7 21.2 27 2027 0.000000 7221.251348 7.7 11.8 28 2028 0.000000 7244.102280 7.8 5.0
This code will filter the merged_data dataframe to only include rows where the Year is less than or equal to 2023, and then calculate the summary statistics for the filtered data. The summary statistics will include mean, variance, median, range, and other standard measures.
We include data up to the year 2023 in our analysis to study historical trends and inform predictive modeling for subsequent years using the specified variables.
# Filter data to only include years up to 2023
filtered_data = merged_data[merged_data['Year'] <= 2023]
# Calculate summary statistics
summary_stats = filtered_data.describe().transpose()
# Calculate additional statistics
summary_stats['variance'] = filtered_data.var()
summary_stats['median'] = filtered_data.median()
summary_stats['range'] = filtered_data.max() - filtered_data.min()
# Display the summary statistics
print(summary_stats)
count mean std min 25% \
Year 24.0 2011.500000 7.071068 2000.000000 2005.750000
Market_Value 24.0 345.456367 244.623899 90.613012 117.796916
Area_Harvested 24.0 3954.583333 1550.849779 2339.000000 2795.750000
Yield 24.0 6.975000 1.038079 5.300000 6.200000
CPI 24.0 25.383333 29.578939 -1.100000 9.400000
50% 75% max variance median \
Year 2011.500000 2017.250000 2023.0 5.000000e+01 2011.500000
Market_Value 250.998043 538.085062 900.0 5.984085e+04 250.998043
Area_Harvested 3456.000000 4975.000000 7100.0 2.405135e+06 3456.000000
Yield 6.750000 8.000000 8.5 1.077609e+00 6.750000
CPI 12.150000 35.025000 134.0 8.749136e+02 12.150000
range
Year 23.000000
Market_Value 809.386988
Area_Harvested 4761.000000
Yield 3.200000
CPI 135.100000
# Function to find the year and value corresponding to the min and max values
def find_min_max_years_values(df, column):
min_row = df.loc[df[column] == df[column].min()]
max_row = df.loc[df[column] == df[column].max()]
min_year = min_row['Year'].values[0]
max_year = max_row['Year'].values[0]
min_value = min_row[column].values[0]
max_value = max_row[column].values[0]
return min_year, min_value, max_year, max_value
# Dictionary to hold the results for specific columns
specific_results = {}
# Columns to analyze
columns_to_analyze = ['Area_Harvested', 'Market_Value', 'Yield']
# Find years and values for each specified column
for column in columns_to_analyze:
min_year, min_value, max_year, max_value = find_min_max_years_values(filtered_data, column)
specific_results[column] = {
'Min Year': min_year,
'Min Value': min_value,
'Max Year': max_year,
'Max Value': max_value
}
# Display the results
for column, details in specific_results.items():
print(f"{column}: {details}")
Area_Harvested: {'Min Year': 2004, 'Min Value': 2339.0, 'Max Year': 2022, 'Max Value': 7100.0}
Market_Value: {'Min Year': 2002, 'Min Value': 90.613012077019, 'Max Year': 2022, 'Max Value': 900.0}
Yield: {'Min Year': 2023, 'Min Value': 5.3, 'Max Year': 2015, 'Max Value': 8.5}
The dataset covers the period from 2000 to 2023, with a focus on forecasting market values for the subsequent five years. The analysis considers historical data until 2023 to make predictions.
Market Value:
Area Harvested:
Yield:
CPI Year-on-Year Percent Change:
This dataset reveals significant variability in market values and harvested areas, with notable peaks observed in 2022. Yield data shows more consistency, with the lowest yield noted in 2023. These trends and statistics are essential for forecasting future market values and understanding agricultural trends.
# List of features to create box plots for
features = ['Market_Value', 'Area_Harvested', 'Yield', 'CPI']
# Create box plots
plt.figure(figsize=(14, 10))
for i, feature in enumerate(features, 1):
plt.subplot(2, 2, i)
sns.boxplot(y=filtered_data[feature])
plt.title(f'Box Plot of {feature}')
plt.xlabel('')
plt.tight_layout()
plt.show()
# List of features to plot against Year
features = ['Market_Value', 'Area_Harvested', 'Yield', 'CPI']
# Plot each feature against Year
plt.figure(figsize=(12, 8))
for i, feature in enumerate(features, 1):
plt.subplot(2, 2, i)
plt.plot(filtered_data['Year'], filtered_data[feature], marker='o')
plt.title(f'Year vs {feature}')
plt.xlabel('Year')
plt.ylabel(feature)
plt.grid(True)
plt.tight_layout()
plt.show()
We have created separate line charts for each of the four variables against the year: Area Harvested, Yield, Consumer Price Index (CPI) Year-on-Year Percent Change, and Market Value.
The visualizations indicate a general growth in agricultural activities and market values for herbicides, despite notable variability in yield and CPI changes. The significant peak in market value in 2022 suggests a potential anomaly or an impactful market event. These insights are crucial for understanding historical trends and informing future market predictions.
# Calculate the correlation matrix
correlation_matrix = filtered_data[['Market_Value', 'Area_Harvested', 'Yield', 'CPI']].corr()
# Print correlation values
print("Correlation Matrix:")
print(correlation_matrix)
# Plot the heatmap
plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', linewidths=0.5)
plt.title('Correlation Heatmap')
plt.show()
Correlation Matrix:
Market_Value Area_Harvested Yield CPI
Market_Value 1.000000 0.898081 0.331799 0.732717
Area_Harvested 0.898081 1.000000 0.227519 0.797040
Yield 0.331799 0.227519 1.000000 -0.009147
CPI 0.732717 0.797040 -0.009147 1.000000
The correlation matrix provides insights into the relationships between different variables in the dataset up to the year 2023. Here is a summary of the key findings:
Market Value:
Insights:
from statsmodels.stats.outliers_influence import variance_inflation_factor
# Calculate VIF
X = filtered_data[['Area_Harvested', 'Yield', 'CPI']]
vif_data = pd.DataFrame()
vif_data["feature"] = X.columns
vif_data["VIF"] = [variance_inflation_factor(X.values, i) for i in range(len(X.columns))]
print(vif_data)
feature VIF 0 Area_Harvested 25.005225 1 Yield 13.495178 2 CPI 5.415824
The VIF values indicate significant multicollinearity among the variables:
# Remove the variable with the highest VIF (Area_Harvested)
X = filtered_data[['Yield', 'CPI']]
# Recalculate VIF
vif_data = pd.DataFrame()
vif_data["feature"] = X.columns
vif_data["VIF"] = [variance_inflation_factor(X.values, i) for i in range(len(X.columns))]
print(vif_data)
feature VIF 0 Yield 1.736739 1 CPI 1.736739
# Remove the variable with the highest VIF (Area_Harvested)
X_df = filtered_data[['Yield', 'Area_Harvested']]
# Recalculate VIF
vif_data = pd.DataFrame()
vif_data["feature"] = X_df.columns
vif_data["VIF"] = [variance_inflation_factor(X_df.values, i) for i in range(len(X_df.columns))]
print(vif_data)
feature VIF 0 Yield 8.018639 1 Area_Harvested 8.018639
We replace outliers with the mean of neighboring values to ensure data consistency and accuracy, thereby mitigating the impact of anomalous values on analysis and model performance.
We create new columns Market_Value_New and CIP_New to retain our original data and keep a data ready for analysis.
# Function to replace outliers with mean of neighboring values
def replace_outliers(df, year_col, value_col, outlier_years):
new_col = f'{value_col}_New'
df[new_col] = df[value_col] # Create the new column
for year in outlier_years:
if year in df[year_col].values:
idx = df.index[df[year_col] == year][0]
if idx > 0 and idx < len(df) - 1:
# Calculate mean of neighboring values
mean_value = (df.at[idx - 1, value_col] + df.at[idx + 1, value_col]) / 2
df.at[idx, new_col] = mean_value
elif idx == 0:
# If the outlier is at the first row, only take the next value
df.at[idx, new_col] = df.at[idx + 1, value_col]
elif idx == len(df) - 1:
# If the outlier is at the last row, only take the previous value
df.at[idx, new_col] = df.at[idx - 1, value_col]
return df
# Outlier years identified from visualization
outlier_years_market_value = [2013, 2022]
outlier_years_cpi = [2002, 2023]
# Replace outliers in 'Market_Value' column
filtered_data = replace_outliers(filtered_data, 'Year', 'Market_Value', outlier_years_market_value)
# Replace outliers in 'CPI' column
filtered_data = replace_outliers(filtered_data, 'Year', 'CPI', outlier_years_cpi)
# Display the modified DataFrame
print(filtered_data)
Year Market_Value Area_Harvested Yield CPI Market_Value_New \
0 2000 103.298834 3100.0 5.5 -0.9 103.298834
1 2001 114.172395 2816.0 5.5 -1.1 114.172395
2 2002 90.613012 2420.0 6.1 25.9 90.613012
3 2003 108.735614 2450.0 6.3 13.4 108.735614
4 2004 117.796916 2339.0 6.4 4.4 117.796916
5 2005 132.294998 2783.0 7.4 9.6 132.294998
6 2006 94.237533 2440.0 6.5 10.9 94.237533
7 2007 117.796916 2800.0 8.0 8.8 117.796916
8 2008 168.540202 3412.0 6.5 8.6 168.540202
9 2009 126.858217 2500.0 6.2 6.3 126.858217
10 2010 154.042121 3000.0 8.3 10.5 154.042121
11 2011 199.348627 3750.0 6.7 9.8 199.348627
12 2012 302.647460 3600.0 5.8 10.0 302.647460
13 2013 537.428899 4000.0 6.8 10.6 421.350506
14 2014 540.053552 3400.0 7.6 21.4 540.053552
15 2015 457.233259 3500.0 8.5 16.4 457.233259
16 2016 498.371566 3700.0 8.0 37.5 498.371566
17 2017 509.245128 4900.0 8.4 25.4 509.245128
18 2018 525.555470 5200.0 6.2 34.2 525.555470
19 2019 572.674236 6100.0 8.4 52.8 572.674236
20 2020 612.169011 6300.0 8.1 40.5 612.169011
21 2021 625.000000 6550.0 7.9 47.1 625.000000
22 2022 900.000000 7100.0 7.0 73.1 653.919417
23 2023 682.838834 6750.0 5.3 134.0 682.838834
CPI_New
0 -0.90
1 -1.10
2 6.15
3 13.40
4 4.40
5 9.60
6 10.90
7 8.80
8 8.60
9 6.30
10 10.50
11 9.80
12 10.00
13 10.60
14 21.40
15 16.40
16 37.50
17 25.40
18 34.20
19 52.80
20 40.50
21 47.10
22 73.10
23 73.10
This step replaces outliers in 'Market_Value' for 2013 and 2022 with the mean of neighboring values to prevent skewing statistical analysis and improve model performance.
The rationale for using ex post forecasting in this analysis is to validate model accuracy by comparing predicted values against known outcomes, ensuring robustness in temporal prediction.
In the regression analysis, various combinations of variables will be utilized to determine the optimal model. The following combinations have been selected based on Variance Inflation Factor (VIF) scores to address multicollinearity:
Combination 1:
[['Yield', 'CPI']]Combination 2:
[['Yield', 'Area_Harvested']]The regression models will be applied to each of these combinations to assess their performance and identify the most suitable model for forecasting.
[['Yield', 'CPI']]¶The code in following cell performs the following steps for training and evaluating multiple regression models using the filtered_data dataset:
train_test_split and StandardScaler from sklearn.model_selection and sklearn.preprocessing for data splitting and scaling.sklearn.linear_model, sklearn.ensemble, sklearn.svm, xgboost, and sklearn.neural_network.sklearn.metrics.Yield and CPI_New) can have different scales (units and ranges). Standardizing ensures that all features contribute equally to the model's performance, improving the convergence speed and accuracy of algorithms sensitive to the scale of input data, such as Support Vector Regression, Gradient Boosting, and Neural Networks.Prepare Features and Target:
Yield and CPI_New as features (X) and Market_Value_New as the target variable (y).Split Data:
train_test_split to split the data randomly with an 80/20 split.Standardize Features:
StandardScaler to standardize the features, ensuring they have a mean of 0 and a standard deviation of 1 for both training and test sets.Initialize Models:
models containing five different regression models: Linear Regression, Gradient Boosting, Support Vector Regression, XGBoost, and Neural Network.Fit Models and Evaluate:
This code essentially prepares the data, standardizes it, trains multiple regression models, and evaluates their performance using various metrics.
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
# Prepare features and target
X = filtered_data[['Yield', 'CPI_New']]
y = filtered_data['Market_Value_New']
# Split data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
# Standardize the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
!pip install xgboost
Requirement already satisfied: xgboost in c:\users\admin\anaconda3\lib\site-packages (2.1.0) Requirement already satisfied: numpy in c:\users\admin\anaconda3\lib\site-packages (from xgboost) (1.22.4) Requirement already satisfied: scipy in c:\users\admin\anaconda3\lib\site-packages (from xgboost) (1.7.3)
#import necessary libraries
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.svm import SVR
import xgboost as xgb
from sklearn.neural_network import MLPRegressor
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error
# Prepare features and target
X = filtered_data[['Yield', 'CPI_New']]
y = filtered_data['Market_Value_New']
# Determine the split point
split_point = int(len(filtered_data) * 0.7)
# Split data into training and test sets
X_train = X[:split_point]
X_test = X[split_point:]
y_train = y[:split_point]
y_test = y[split_point:]
# Standardize the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# Initialize models
models = {
"Linear Regression": LinearRegression(),
"Gradient Boosting": GradientBoostingRegressor(random_state=0),
"Support Vector Regression": SVR(),
"XGBoost": xgb.XGBRegressor(random_state=0),
"Neural Network": MLPRegressor(random_state=0, max_iter=1000)
}
# Fit models and evaluate
for name, model in models.items():
model.fit(X_train_scaled, y_train)
y_pred = model.predict(X_test_scaled)
print(f"{name}")
print("Actual Values: ", y_test.values)
print("Predicted Values:", y_pred)
print("Mean Squared Error:", mean_squared_error(y_test, y_pred))
print("Root Mean Squared Error:", np.sqrt(mean_squared_error(y_test, y_pred)))
print("Mean Absolute Error:", mean_absolute_error(y_test, y_pred))
print("R^2 Score:", r2_score(y_test, y_pred))
print("-" * 30)
Linear Regression Actual Values: [498.37156642 509.24512787 525.55547005 572.67423633 612.16901125 625. 653.9194169 682.83883379] Predicted Values: [ 737.14440826 500.74361833 691.63282666 1027.06802931 793.77795443 922.54325546 1430.91410567 1447.79874628] Mean Squared Error: 200192.29554520088 Root Mean Squared Error: 447.4285368024718 Mean Absolute Error: 361.1065376112989 R^2 Score: -46.67996649335153 ------------------------------ Gradient Boosting Actual Values: [498.37156642 509.24512787 525.55547005 572.67423633 612.16901125 625. 653.9194169 682.83883379] Predicted Values: [537.57974213 507.57743104 514.39334225 507.57743104 507.23704783 537.57974213 544.49214159 537.91445722] Mean Squared Error: 7191.5839187379215 Root Mean Squared Error: 84.80320700738812 Mean Absolute Error: 70.47983484938479 R^2 Score: -0.7128255577754168 ------------------------------ Support Vector Regression Actual Values: [498.37156642 509.24512787 525.55547005 572.67423633 612.16901125 625. 653.9194169 682.83883379] Predicted Values: [129.57188841 130.35771788 129.57908887 129.55986204 129.56176223 129.55987609 129.55986199 129.55986199] Mean Squared Error: 211519.7189672591 Root Mean Squared Error: 459.91272972952066 Mean Absolute Error: 455.3079678871648 R^2 Score: -49.37782840531405 ------------------------------ XGBoost Actual Values: [498.37156642 509.24512787 525.55547005 572.67423633 612.16901125 625. 653.9194169 682.83883379] Predicted Values: [540.0518 540.0518 540.05133 540.0518 540.0518 540.0523 540.0523 540.05133] Mean Squared Error: 6216.454684064235 Root Mean Squared Error: 78.84449685339005 Mean Absolute Error: 66.66558995724151 R^2 Score: -0.4805782122454232 ------------------------------ Neural Network Actual Values: [498.37156642 509.24512787 525.55547005 572.67423633 612.16901125 625. 653.9194169 682.83883379] Predicted Values: [ 583.53528018 437.81269472 401.78014816 835.55253396 634.56740648 715.37562494 1025.29641547 898.47570389] Mean Squared Error: 36233.79460432775 Root Mean Squared Error: 190.3517654352797 Mean Absolute Error: 155.37970691101145 R^2 Score: -7.629833170933656 ------------------------------
| Model | Mean Squared Error (MSE) | Root Mean Squared Error (RMSE) | Mean Absolute Error (MAE) | R² Score |
|---|---|---|---|---|
| Linear Regression | 200192.30 | 447.43 | 361.11 | -46.6800 |
| Gradient Boosting | 7191.58 | 84.80 | 70.48 | -0.7128 |
| Support Vector Regression | 211519.72 | 459.91 | 455.31 | -49.3778 |
| XGBoost | 6216.45 | 78.84 | 66.67 | -0.4806 |
| Neural Network | 36233.79 | 190.35 | 155.38 | -7.6298 |
Linear Regression:
Gradient Boosting:
Support Vector Regression:
XGBoost:
Neural Network:
[['Yield', 'CPI']])¶# filtering the data required for prediction
pred_data_1 = merged_data.iloc[24:][['Year','Yield','CPI']]
print(pred_data_1)
Year Yield CPI 24 2024 8.1 272.2 25 2025 7.6 70.9 26 2026 7.7 21.2 27 2027 7.7 11.8 28 2028 7.8 5.0
# Scaling the data
X_pred_scaled_1 = scaler.transform((pred_data_1[['Yield','CPI']]))
# Access the trained XGBoost model
xgb_model = models["XGBoost"]
# Predict new values using the trained model
y_predict_1 = xgb_model.predict(X_pred_scaled_1)
pred_data_1['Market_Value'] = y_predict_1
print(pred_data_1)
Year Yield CPI Market_Value 24 2024 8.1 272.2 540.051819 25 2025 7.6 70.9 540.052307 26 2026 7.7 21.2 446.820984 27 2027 7.7 11.8 132.953583 28 2028 7.8 5.0 111.102394
Between 2024 and 2028, market value declines significantly despite stable yields. In 2024, a high CPI of 272.2 corresponds with a market value of USD 540.05 million. By 2025, although the CPI drops to 70.9, the market value remains stable at USD 540.05 million. In 2026, with a CPI of 21.2, the market value decreases to USD 446.82 million. The trend continues in 2027 with the market value dropping sharply to USD 132.95 million as the CPI falls to 11.8. By 2028, the market value reaches USD 111.10 million with a low CPI of 5.0. The decreasing CPI, indicating deflation, appears to significantly impact market values.
[['Yield', 'Area_Harvested']]¶# Prepare features and target
X_1 = filtered_data[['Yield', 'Area_Harvested']]
y_1 = filtered_data['Market_Value_New']
# Determine the split point
split_point = int(len(filtered_data) * 0.7)
# Split data into training and test sets
X_train_1 = X_1[:split_point]
X_test_1 = X_1[split_point:]
y_train_1 = y_1[:split_point]
y_test_1 = y_1[split_point:]
# Standardize the features
scaler = StandardScaler()
X_train_scaled_1 = scaler.fit_transform(X_train_1)
X_test_scaled_1 = scaler.transform(X_test_1)
# Initialize models
models = {
"Linear Regression": LinearRegression(),
"Gradient Boosting": GradientBoostingRegressor(random_state=0),
"Support Vector Regression": SVR(),
"XGBoost": xgb.XGBRegressor(random_state=0),
"Neural Network": MLPRegressor(random_state=0, max_iter=1000)
}
# Fit models and evaluate
for name, model in models.items():
model.fit(X_train_scaled_1, y_train_1)
y_pred_1 = model.predict(X_test_scaled_1)
print(f"{name}")
print("Actual Values: ", y_test_1.values)
print("Predicted Values:", y_pred_1)
print("Mean Squared Error:", mean_squared_error(y_test_1, y_pred_1))
print("Root Mean Squared Error:", np.sqrt(mean_squared_error(y_test_1, y_pred_1)))
print("Mean Absolute Error:", mean_absolute_error(y_test_1, y_pred_1))
print("R^2 Score:", r2_score(y_test_1, y_pred_1))
print("-" * 30)
Linear Regression Actual Values: [498.37156642 509.24512787 525.55547005 572.67423633 612.16901125 625. 653.9194169 682.83883379] Predicted Values: [381.10274737 614.26813431 568.9890942 829.42128006 851.77095679 887.58840823 945.67272402 806.36769811] Mean Squared Error: 39916.05861992978 Root Mean Squared Error: 199.79003633797603 Mean Absolute Error: 179.99312732449633 R^2 Score: -8.506841071789347 ------------------------------ Gradient Boosting Actual Values: [498.37156642 509.24512787 525.55547005 572.67423633 612.16901125 625. 653.9194169 682.83883379] Predicted Values: [456.18455633 431.57089064 199.99754744 431.57089064 428.6522523 430.57524259 421.29913656 301.86127223] Mean Squared Error: 50555.82880401083 Root Mean Squared Error: 224.84623368873855 Mean Absolute Error: 197.25773423525322 R^2 Score: -11.040924036832337 ------------------------------ Support Vector Regression Actual Values: [498.37156642 509.24512787 525.55547005 572.67423633 612.16901125 625. 653.9194169 682.83883379] Predicted Values: [135.05359887 132.32778516 132.30816009 132.2267862 132.22674094 132.22672353 132.22672088 132.22672107] Mean Squared Error: 208901.38556081773 Root Mean Squared Error: 457.0573110243591 Mean Absolute Error: 452.36880323404046 R^2 Score: -48.75421774763337 ------------------------------ XGBoost Actual Values: [498.37156642 509.24512787 525.55547005 572.67423633 612.16901125 625. 653.9194169 682.83883379] Predicted Values: [457.23843 448.26352 182.2107 448.26352 448.26352 448.26517 421.3505 283.40018] Mean Squared Error: 51314.26362951942 Root Mean Squared Error: 226.5265186010667 Mean Absolute Error: 192.814767365053 R^2 Score: -11.221561093663905 ------------------------------ Neural Network Actual Values: [498.37156642 509.24512787 525.55547005 572.67423633 612.16901125 625. 653.9194169 682.83883379] Predicted Values: [312.34901483 566.74702553 457.331885 790.98686945 805.79393293 837.45998432 872.51691429 679.21682571] Mean Squared Error: 27581.695164641125 Root Mean Squared Error: 166.07737704046608 Mean Absolute Error: 144.79563486210975 R^2 Score: -5.569155409794445 ------------------------------
| Model | Mean Squared Error (MSE) | Root Mean Squared Error (RMSE) | Mean Absolute Error (MAE) | R² Score |
|---|---|---|---|---|
| Linear Regression | 39916.06 | 199.79 | 179.99 | -8.5068 |
| Gradient Boosting | 50555.83 | 224.85 | 197.26 | -11.0409 |
| Support Vector Regression | 208901.39 | 457.06 | 452.37 | -48.7542 |
| XGBoost | 51314.26 | 226.53 | 192.81 | -11.2216 |
| Neural Network | 27581.70 | 166.08 | 144.80 | -5.5692 |
Linear Regression:
Gradient Boosting:
Support Vector Regression:
XGBoost:
Neural Network:
[['Yield', 'Area_Harvested']])¶# filtering the data required for prediction
pred_data_2 = merged_data.iloc[24:][['Year','Yield','Area_Harvested']]
print(pred_data_2)
Year Yield Area_Harvested 24 2024 8.1 6800.000000 25 2025 7.6 7200.000000 26 2026 7.7 7132.197788 27 2027 7.7 7221.251348 28 2028 7.8 7244.102280
# Scaling the data
X_pred_scaled_2 = scaler.transform((pred_data_2[['Yield','Area_Harvested']]))
# Access the trained Neural Network model
nn_model = models["Neural Network"]
# Predict new values using the trained model
y_predict_2 = nn_model.predict(X_pred_scaled_2)
pred_data_2['Predicted Market Value (USD)'] = y_predict_2
print(pred_data_2)
Year Yield Area_Harvested Predicted Market Value (USD) 24 2024 8.1 6800.000000 899.215950 25 2025 7.6 7200.000000 936.340831 26 2026 7.7 7132.197788 931.194935 27 2027 7.7 7221.251348 947.834032 28 2028 7.8 7244.102280 959.626104
Between 2024 and 2028, the trained Neural Network model predicts a steady increase in market value, reaching USD 959.63 million by 2028. Yield remains relatively stable, fluctuating between 7.6 and 8.1 tons/ha, while the area harvested grows from 6800 to 7244 hectares. This suggests that despite minor variations in yield and harvested area, the market value is expected to rise consistently over the period.
In response to poor ex post forecasting results using regression, random sampling was applied to enhance model validation by ensuring that training and test data are representative of the entire dataset.
[['Yield', 'CPI']]¶import joblib
np.random.seed(0)
# Prepare features and target
X_2 = filtered_data[['Yield', 'CPI_New']]
y_2 = filtered_data['Market_Value_New']
# Split data into training and test sets
X_train_2, X_test_2, y_train_2, y_test_2 = train_test_split(X_2, y_2, test_size=0.2, random_state=0)
# Standardize the features
scaler = StandardScaler()
X_train_scaled_2 = scaler.fit_transform(X_train_2)
X_test_scaled_2 = scaler.transform(X_test_2)
# Initialize models
models = {
"Linear Regression": LinearRegression(),
"Gradient Boosting": GradientBoostingRegressor(random_state=0),
"Support Vector Regression": SVR(),
"XGBoost": xgb.XGBRegressor(random_state=0),
"Neural Network": MLPRegressor(random_state=0, max_iter=1000)
}
# Fit models and evaluate
for name, model in models.items():
model.fit(X_train_scaled_2, y_train_2)
y_pred_2 = model.predict(X_test_scaled_2)
print(f"{name}")
print("Actual Values: ", y_test_2.values)
print("Predicted Values:", y_pred_2)
print("Mean Squared Error:", mean_squared_error(y_test_2, y_pred_2))
print("Root Mean Squared Error:", np.sqrt(mean_squared_error(y_test_2, y_pred_2)))
print("Mean Absolute Error:", mean_absolute_error(y_test_2, y_pred_2))
print("R^2 Score:", r2_score(y_test_2, y_pred_2))
print("-" * 30)
Linear Regression Actual Values: [199.34862657 154.04212053 653.9194169 540.05355198 612.16901125] Predicted Values: [211.79944641 276.5885912 783.59804879 347.52637127 535.05353473] Mean Squared Error: 15000.543997917877 Root Mean Squared Error: 122.47670798122343 Mean Absolute Error: 106.86371592745158 R^2 Score: 0.6663162535794338 ------------------------------ Gradient Boosting Actual Values: [199.34862657 154.04212053 653.9194169 540.05355198 612.16901125] Predicted Values: [150.7034757 401.49489273 662.06813274 516.9090123 498.81220423] Mean Squared Error: 15410.212431117656 Root Mean Squared Error: 124.13787669812004 Mean Absolute Error: 88.14959712064571 R^2 Score: 0.6572032709036508 ------------------------------ Support Vector Regression Actual Values: [199.34862657 154.04212053 653.9194169 540.05355198 612.16901125] Predicted Values: [168.97183059 173.1619798 173.89771811 173.14459834 176.25000889] Mean Squared Error: 111271.34138684312 Root Mean Squared Error: 333.5735921604753 Mean Absolute Error: 266.4692620067271 R^2 Score: -1.4752061037491244 ------------------------------ XGBoost Actual Values: [199.34862657 154.04212053 653.9194169 540.05355198 612.16901125] Predicted Values: [163.1888 330.12357 625.01636 514.72876 498.37286] Mean Squared Error: 9347.700745269864 Root Mean Squared Error: 96.6835081348927 Mean Absolute Error: 76.05305477570961 R^2 Score: 0.7920624874982588 ------------------------------ Neural Network Actual Values: [199.34862657 154.04212053 653.9194169 540.05355198 612.16901125] Predicted Values: [ 90.36357526 168.13778621 414.52421336 173.2148767 308.16612539] Mean Squared Error: 59274.97219213003 Root Mean Squared Error: 243.46451937013333 Mean Absolute Error: 206.66349632978168 R^2 Score: -0.3185584997977562 ------------------------------
The results of the model evaluations with random sampling are as follows:
| Model | Mean Squared Error (MSE) | Root Mean Squared Error (RMSE) | Mean Absolute Error (MAE) | R² Score |
|---|---|---|---|---|
| Linear Regression | 15000.54 | 122.48 | 106.86 | 0.6663 |
| Gradient Boosting | 15410.21 | 124.14 | 88.15 | 0.6572 |
| Support Vector Regression | 111271.34 | 333.57 | 266.47 | -1.4752 |
| XGBoost | 9347.70 | 96.68 | 76.05 | 0.7921 |
| Neural Network | 59274.97 | 243.46 | 206.66 | -0.3186 |
Linear Regression:
Gradient Boosting:
Support Vector Regression:
XGBoost:
Neural Network:
[['Yield', 'CPI']], random sampling)¶# filtering the data required for prediction
pred_data_3 = merged_data.iloc[24:][['Year','Yield','CPI']]
print(pred_data_3)
Year Yield CPI 24 2024 8.1 272.2 25 2025 7.6 70.9 26 2026 7.7 21.2 27 2027 7.7 11.8 28 2028 7.8 5.0
# Scaling the data
X_pred_scaled_3 = scaler.transform((pred_data_3)[['Yield','CPI']])
# Access the trained XGBoost model
xgb_model = models["XGBoost"]
# Predict new values using the trained model
y_predict_3 = xgb_model.predict(X_pred_scaled_3)
pred_data_3['Market_Value'] = y_predict_3
print(pred_data_3)
Year Yield CPI Market_Value 24 2024 8.1 272.2 572.647339 25 2025 7.6 70.9 624.999634 26 2026 7.7 21.2 514.728760 27 2027 7.7 11.8 127.414009 28 2028 7.8 5.0 124.699387
# Select and filter the relevant columns from merged_data till year 2023
merged_data_filtered = merged_data[merged_data['Year'] <= 2023][['Year', 'Yield', 'Market_Value', 'CPI']]
# Select and rename the relevant columns from pred_data_3
pred_data_filtered = pred_data_3[['Year', 'Yield', 'Market_Value', 'CPI']]
# Concatenate the DataFrames
combined_data = pd.concat([merged_data_filtered, pred_data_filtered], ignore_index=True)
print(combined_data)
Year Yield Market_Value CPI 0 2000 5.5 103.298834 -0.9 1 2001 5.5 114.172395 -1.1 2 2002 6.1 90.613012 25.9 3 2003 6.3 108.735614 13.4 4 2004 6.4 117.796916 4.4 5 2005 7.4 132.294998 9.6 6 2006 6.5 94.237533 10.9 7 2007 8.0 117.796916 8.8 8 2008 6.5 168.540202 8.6 9 2009 6.2 126.858217 6.3 10 2010 8.3 154.042121 10.5 11 2011 6.7 199.348627 9.8 12 2012 5.8 302.647460 10.0 13 2013 6.8 537.428899 10.6 14 2014 7.6 540.053552 21.4 15 2015 8.5 457.233259 16.4 16 2016 8.0 498.371566 37.5 17 2017 8.4 509.245128 25.4 18 2018 6.2 525.555470 34.2 19 2019 8.4 572.674236 52.8 20 2020 8.1 612.169011 40.5 21 2021 7.9 625.000000 47.1 22 2022 7.0 900.000000 73.1 23 2023 5.3 682.838834 134.0 24 2024 8.1 572.647339 272.2 25 2025 7.6 624.999634 70.9 26 2026 7.7 514.728760 21.2 27 2027 7.7 127.414009 11.8 28 2028 7.8 124.699387 5.0
# Given error values
mse = 9347.700745269864
rmse = 96.6835081348927
mae = 76.05305477570961
r2 = 0.7920624874982588
# Calculate the standard error
std_error = np.sqrt(mse)
# Calculate the confidence intervals (95%)
z_value = 1.96 # for 95% confidence
combined_data['Lower_Bound'] = combined_data['Market_Value'] - z_value * std_error
combined_data['Upper_Bound'] = combined_data['Market_Value'] + z_value * std_error
# Display the prediction intervals for the predicted years (2024 onwards)
predicted_years = combined_data[combined_data['Year'] >= 2024]
print(predicted_years[['Year', 'Market_Value', 'Lower_Bound', 'Upper_Bound']])
# Plot the actual and predicted market values
plt.figure(figsize=(10, 6))
# Plot actual values
plt.plot(combined_data['Year'], combined_data['Market_Value'], label='Actual Market Value', marker='o', color='blue')
# Highlight the predicted values from 2024 onwards in orange
predicted_years = combined_data[combined_data['Year'] >= 2024]
plt.plot(predicted_years['Year'], predicted_years['Market_Value'], label='Predicted Market Value', marker='x', linestyle='--', color='orange')
# Plot the prediction intervals
plt.scatter(predicted_years['Year'], predicted_years['Lower_Bound'], color='blue', marker='*', label='Lower PI')
plt.scatter(predicted_years['Year'], predicted_years['Upper_Bound'], color='blue', marker='*', label='Upper PI')
plt.fill_between(predicted_years['Year'], predicted_years['Lower_Bound'], predicted_years['Upper_Bound'], color='orange', alpha=0.3, label='Prediction Interval')
plt.xlabel('Year')
plt.ylabel('Market Value (USD)')
plt.title('Actual vs Predicted Market Value with Prediction Intervals')
plt.legend()
plt.grid(True)
plt.show()
Year Market_Value Lower_Bound Upper_Bound 24 2024 572.647339 383.147663 762.147015 25 2025 624.999634 435.499958 814.499310 26 2026 514.728760 325.229084 704.228436 27 2027 127.414009 -62.085667 316.913685 28 2028 124.699387 -64.800289 314.199063
Between 2024 and 2028, the XGBoost model predicts a fluctuating market value, peaking at USD 625 million in 2025 before declining sharply to USD125 million by 2028. Yield remains stable around 7.6 to 8.1 tons/ha, while CPI significantly decreases from 272.2 to 5.0, indicating a deflationary trend. Despite consistent yields, the declining CPI appears to correlate with a substantial drop in market value towards the end of the period.
The predicted market values for the years 2024 to 2028 exhibit significant variability, as indicated by their corresponding prediction intervals. In 2024, the market value is predicted to be approximately USD 572.65, with a 95% confidence interval ranging from USD 383.15 to USD 762.15. Similarly, for 2025, the predicted value is around USD 625.00, with an interval from USD 435.50 to USD 814.50. However, in subsequent years, the intervals widen and include negative values, reflecting increased uncertainty. For instance, the 2027 prediction of USD 127.41 spans from USD -62.09 to USD 316.91, and in 2028, the value of USD 124.70 ranges from USD -64.80 to USD 314.20. These intervals highlight the model's diminishing confidence in long-term predictions, underscoring the potential volatility and uncertainty in future market values.
[['Yield', 'Area_Harvested']]¶np.random.seed(0)
# Prepare features and target
X_3 = filtered_data[['Yield', 'Area_Harvested']]
y_3 = filtered_data['Market_Value_New']
# Split data into training and test sets
X_train_3, X_test_3, y_train_3, y_test_3 = train_test_split(X_3, y_3, test_size=0.2, random_state=0)
# Standardize the features
scaler = StandardScaler()
X_train_scaled_3 = scaler.fit_transform(X_train_3)
X_test_scaled_3 = scaler.transform(X_test_3)
# Initialize models
models = {
"Linear Regression": LinearRegression(),
"Gradient Boosting": GradientBoostingRegressor(random_state=0),
"Support Vector Regression": SVR(),
"XGBoost": xgb.XGBRegressor(random_state=0),
"Neural Network": MLPRegressor(random_state=0, max_iter=1000)
}
# Fit models and evaluate
for name, model in models.items():
model.fit(X_train_scaled_3, y_train_3)
y_pred_3 = model.predict(X_test_scaled_3)
print(f"{name}")
print("Actual Values: ", y_test_3.values)
print("Predicted Values:", y_pred_3)
print("Mean Squared Error:", mean_squared_error(y_test_3, y_pred_3))
print("Root Mean Squared Error:", np.sqrt(mean_squared_error(y_test_3, y_pred_3)))
print("Mean Absolute Error:", mean_absolute_error(y_test_3, y_pred_3))
print("R^2 Score:", r2_score(y_test_3, y_pred_3))
print("-" * 30)
Linear Regression Actual Values: [199.34862657 154.04212053 653.9194169 540.05355198 612.16901125] Predicted Values: [298.40307779 257.07339817 747.39466084 284.55982473 681.77063886] Mean Squared Error: 19857.25618819813 Root Mean Squared Error: 140.91577693146402 Mean Absolute Error: 124.13126553474847 R^2 Score: 0.5582797770913748 ------------------------------ Gradient Boosting Actual Values: [199.34862657 154.04212053 653.9194169 540.05355198 612.16901125] Predicted Values: [436.37243614 115.08548697 659.84278816 177.17403284 573.25046317] Mean Squared Error: 38185.83814483376 Root Mean Squared Error: 195.41197032125172 Mean Absolute Error: 136.74037632243926 R^2 Score: 0.15056457058183204 ------------------------------ Support Vector Regression Actual Values: [199.34862657 154.04212053 653.9194169 540.05355198 612.16901125] Predicted Values: [170.07935442 171.90944137 173.45048622 171.17437544 174.69577081] Mean Squared Error: 111896.20155355385 Root Mean Squared Error: 334.5088960753568 Mean Absolute Error: 266.7915881274097 R^2 Score: -1.489105978410067 ------------------------------ XGBoost Actual Values: [199.34862657 154.04212053 653.9194169 540.05355198 612.16901125] Predicted Values: [437.5803 118.23085 620.2428 119.82225 572.67896] Mean Squared Error: 47464.939887327215 Root Mean Squared Error: 217.86449891464008 Mean Absolute Error: 153.48818246284506 R^2 Score: -0.055846972445047216 ------------------------------ Neural Network Actual Values: [199.34862657 154.04212053 653.9194169 540.05355198 612.16901125] Predicted Values: [129.7573991 173.93873192 447.76964348 152.29159375 441.44798985] Mean Squared Error: 45448.30931272621 Root Mean Squared Error: 213.18609080501994 Mean Absolute Error: 170.8241183807458 R^2 Score: -0.010987476324604728 ------------------------------
The results of the model evaluations with random sampling are as follows:
| Model | Mean Squared Error (MSE) | Root Mean Squared Error (RMSE) | Mean Absolute Error (MAE) | R² Score |
|---|---|---|---|---|
| Linear Regression | 19857.26 | 140.92 | 124.13 | 0.5583 |
| Gradient Boosting | 38185.84 | 195.41 | 136.74 | 0.1506 |
| Support Vector Regression | 111896.20 | 334.51 | 266.79 | -1.4891 |
| XGBoost | 47464.94 | 217.86 | 153.49 | -0.0558 |
| Neural Network | 45448.31 | 213.19 | 170.82 | -0.0110 |
Linear Regression:
Gradient Boosting:
Support Vector Regression:
XGBoost:
Neural Network:
# filtering the data required for prediction
pred_data_4 = merged_data.iloc[24:][['Year','Yield','Area_Harvested']]
print(pred_data_4)
Year Yield Area_Harvested 24 2024 8.1 6800.000000 25 2025 7.6 7200.000000 26 2026 7.7 7132.197788 27 2027 7.7 7221.251348 28 2028 7.8 7244.102280
# Scaling the data
X_pred_scaled_4 = scaler.transform((pred_data_4[['Yield','Area_Harvested']]))
# Access the trained Linear regresssion model
Lr_model = models["Linear Regression"]
# Predict new values using the trained model
y_predict_4 = nn_model.predict(X_pred_scaled_4)
pred_data_4['Market_Value'] = y_predict_4
print(pred_data_4)
Year Yield Area_Harvested Market_Value 24 2024 8.1 6800.000000 486.162202 25 2025 7.6 7200.000000 492.329445 26 2026 7.7 7132.197788 492.188767 27 2027 7.7 7221.251348 500.166677 28 2028 7.8 7244.102280 508.146956
Using the Linear Regression model, the predicted market value shows a steady increase from $486.16 million in 2024 to $508.15 million in 2028. Yield remains stable between 7.6 and 8.1 tons/ha, while the area harvested increases slightly from 6800 to 7244 hectares. This indicates that the model forecasts a gradual rise in market value despite relatively consistent yield and harvested area.
Given these results, Result 3 with XGBoost stands out as the best choice for forecasting the future 5 years because it has the lowest errors and the highest positive R² score, indicating a good fit and better predictive capability.
In Result 3 we used Random sampling and Combination 1 for conduct regression analysis
Result 3 with XGBoost:
This model and result combination demonstrates the best performance and should be used for making forecasts for the next 5 years.
Note: The visualization of the results was performed exclusively for this model due to its relative best performance.
Autocorrelation: Autocorrelation, also known as serial correlation, measures the correlation of a time series with its own past values. It is a crucial aspect in time series analysis and forecasting because it helps identify patterns, trends, and seasonality in the data.
Importance in Forecasting:
from statsmodels.graphics.tsaplots import plot_acf
from scipy.stats import boxcox
# Identify and replace the outlier (assuming the spike is at index 22 with value 900)
outlier_index = 22
neighbors_indices = [outlier_index - 1, outlier_index + 1]
# Calculate the median of the neighboring points
neighbors_values = filtered_data.loc[neighbors_indices, 'Market_Value_New']
median_value = neighbors_values.median()
# Replace the outlier with the median value
filtered_data.at[outlier_index, 'Market_Value_New'] = median_value
# Make the data stationary using Box-Cox transformation and differencing
filtered_data['Market_Value_boxcox'], lam = boxcox(filtered_data['Market_Value_New'])
filtered_data['Market_Value_stationary'] = filtered_data['Market_Value_boxcox'].diff().dropna()
# Plot autocorrelation
plt.rc("figure", figsize=(11, 5))
plot_acf(filtered_data['Market_Value_stationary'].dropna())
plt.xlabel('Lags', fontsize=18)
plt.ylabel('Correlation', fontsize=18)
plt.xticks(fontsize=18)
plt.yticks(fontsize=18)
plt.title('Autocorrelation Plot', fontsize=20)
plt.tight_layout()
plt.show()
Interpreting the Graph: The provided graph appears to be an Autocorrelation Function (ACF) plot, which shows the correlation of the time series with its lags.
Graph Interpretation:
Conclusion:
Understanding and interpreting autocorrelation is essential for selecting the correct forecasting model and improving the accuracy of predictions.
Naive Model:
The naive forecast model is a simple time series forecasting method that assumes the value at the next time step is equal to the value at the current time step plus the change observed in the most recent time step. It is particularly useful for data with a strong trend or seasonal patterns.
Why Choose the Naive Model?
Mathematical Formula
The naive forecast model can be expressed mathematically as follows:
Where:
This formula reflects the idea that the forecast for the next period is the current value plus the most recent change observed in the time series.
This function iterates over the dataset, applying the naive forecasting formula to generate forecasted values. The first forecast is undefined because it requires a previous time step to calculate the change.
# Implement the naive model
def naive_forecast(df, column_name):
forecasted_values = [None] # The first forecast is undefined
for t in range(1, len(df)):
y_t = df[column_name].iloc[t]
y_t_minus_1 = df[column_name].iloc[t - 1]
y_hat_t_plus_1 = y_t + (y_t - y_t_minus_1)
forecasted_values.append(y_hat_t_plus_1)
return forecasted_values
# Apply the naive forecast
filtered_data['Market_Value_Forecast'] = naive_forecast(filtered_data, 'Market_Value_New')
# Keep the first forecasted value the same as the actual value for the first year
filtered_data.loc[0, 'Market_Value_Forecast'] = filtered_data.loc[0, 'Market_Value_New']
# Calculate evaluation metrics for the period where predictions are available
filtered_data = filtered_data.dropna(subset=['Market_Value_Forecast'])
mae = mean_absolute_error(filtered_data['Market_Value_New'][1:], filtered_data['Market_Value_Forecast'][1:])
mse = mean_squared_error(filtered_data['Market_Value_New'][1:], filtered_data['Market_Value_Forecast'][1:])
rmse = np.sqrt(mse)
print(f'Mean Absolute Error (MAE): {mae}')
print(f'Mean Squared Error (MSE): {mse}')
print(f'Root Mean Squared Error (RMSE): {rmse}')
# Extend the forecast for the next 5 years
last_value = filtered_data['Market_Value_New'].iloc[-1]
last_forecasted_value = filtered_data['Market_Value_Forecast'].iloc[-1]
forecasted_values = [last_forecasted_value]
for i in range(5):
next_forecast = forecasted_values[-1] + (forecasted_values[-1] - last_value)
forecasted_values.append(next_forecast)
last_value = forecasted_values[-2]
# Create a DataFrame for the extended forecast
forecast_years = range(filtered_data['Year'].max() + 1, filtered_data['Year'].max() + 6)
forecast_df = pd.DataFrame({'Year': forecast_years, 'Market_Value_Forecast': forecasted_values[1:]})
# Calculate the confidence interval
confidence_interval = 1.96 * rmse
forecast_df['Lower_CI'] = forecast_df['Market_Value_Forecast'] - confidence_interval
forecast_df['Upper_CI'] = forecast_df['Market_Value_Forecast'] + confidence_interval
# Output the forecasted values
print(forecast_df)
# Plot the actual and forecasted values along with the confidence interval
plt.figure(figsize=(10, 6))
plt.plot(filtered_data['Year'], filtered_data['Market_Value_New'], label='Actual Market Value', marker='o')
plt.plot(filtered_data['Year'], filtered_data['Market_Value_Forecast'], label='Forecasted Market Value', marker='x', linestyle='--')
plt.plot(forecast_df['Year'], forecast_df['Market_Value_Forecast'], label='Extended Forecast', marker='x', linestyle='--')
plt.scatter(forecast_df['Year'], forecast_df['Lower_CI'], color='blue', marker='*', label='Lower PI')
plt.scatter(forecast_df['Year'], forecast_df['Upper_CI'], color='blue', marker='*', label='Upper PI')
plt.fill_between(forecast_df['Year'], forecast_df['Lower_CI'], forecast_df['Upper_CI'], color='gray', alpha=0.2, label='95% Prediction Interval')
plt.xlabel('Year')
plt.ylabel('Market Value')
plt.title('Actual vs Forecasted Market Value with Confidence Interval')
plt.legend()
plt.grid(True)
plt.show()
Mean Absolute Error (MAE): 41.3816632015416 Mean Squared Error (MSE): 2774.171766262752 Root Mean Squared Error (RMSE): 52.670406930863486 Year Market_Value_Forecast Lower_CI Upper_CI 0 2024 740.677668 637.443670 843.911665 1 2025 769.597084 666.363087 872.831082 2 2026 798.516501 695.282504 901.750499 3 2027 827.435918 724.201921 930.669916 4 2028 856.355335 753.121338 959.589333
Forecast Results: The naive forecast model predicts the following market values for the next five years, with associated 95% confidence intervals:
| Year | Market_Value_Forecast | Lower_PI | Upper_PI |
|---|---|---|---|
| 2024 | 740.68 | 637.44 | 843.91 |
| 2025 | 769.60 | 666.36 | 872.83 |
| 2026 | 798.52 | 695.28 | 901.75 |
| 2027 | 827.44 | 724.20 | 930.67 |
| 2028 | 856.36 | 753.12 | 959.59 |
Model Evaluation Metrics for Naive Forecast Model:
These metrics indicate the average magnitude of errors in the forecast. MAE provides the average error in absolute terms, MSE emphasizes larger errors by squaring them, and RMSE offers a measure of the standard deviation of the residuals.
Interpretation: The forecasted values show a steady increase in market value over the next five years, with the confidence intervals widening over time, reflecting increasing uncertainty in long-term predictions. The naive forecast model relies heavily on the most recent observations, assuming the same rate of change continues into the future.
Conclusion: The model's performance metrics suggest moderate accuracy, with the errors in forecasted values being relatively consistent. The increasing trend in forecasted market values aligns with historical data patterns, but the widening confidence intervals indicate growing uncertainty. This suggests the need for cautious interpretation of long-term forecasts and consideration of additional models for more robust predictions.
from statsmodels.tsa.holtwinters import Holt
import plotly.graph_objects as go
# Split train and test
train = filtered_data.iloc[:-int(len(filtered_data) * 0.2)]
test = filtered_data.iloc[-int(len(filtered_data) * 0.2):]
def plot_func(forecast1: list[float], title: str) -> None:
"""Function to plot the forecasts."""
fig = go.Figure()
fig.add_trace(go.Scatter(x=train['Year'], y=train['Market_Value_New'], name='Train'))
fig.add_trace(go.Scatter(x=test['Year'], y=test['Market_Value_New'], name='Test'))
fig.add_trace(go.Scatter(x=test['Year'], y=forecast1, name="Holt's Linear"))
fig.update_layout(template="simple_white", font=dict(size=18), title_text=title,
width=700, title_x=0.5, height=400, xaxis_title='Year',
yaxis_title='Market Value (USD)')
return fig.show()
# Fit Holt's model and get forecasts
model_holt = Holt(train['Market_Value_New'], damped_trend=True).fit(optimized=True)
forecasts_holt = model_holt.forecast(len(test))
# Calculate and print optimized parameters
alpha_holt = model_holt.params['smoothing_level']
beta_holt = model_holt.params.get('smoothing_trend', 'Not available')
print(f'Holt - Optimized alpha: {alpha_holt}')
print(f'Holt - Optimized beta: {beta_holt}')
# Plot the forecasts
plot_func(forecasts_holt, "Holt's Exponential Smoothing")
Holt - Optimized alpha: 0.9999999850988388 Holt - Optimized beta: 0.0
Holt's Linear Trend Model: Holt's method extends simple exponential smoothing by including a trend component and hence called double exponential smoothening, using two parameters:
Why Holt's Method:
But we discard using this method:
from statsmodels.tsa.ar_model import AutoReg
# Fit an AR(1) model to the data
model_ar1 = AutoReg(filtered_data['Market_Value'], lags=1).fit()
# Print the summary of the AR(1) model
print(model_ar1.summary())
# Forecast the next 5 years
forecast_periods = 5
forecast_years = np.arange(filtered_data['Year'].max() + 1, filtered_data['Year'].max() + 6)
forecasts_ar1 = model_ar1.predict(start=len(filtered_data), end=len(filtered_data) + forecast_periods - 1, dynamic=False)
# Calculate the confidence intervals (assuming normal distribution of residuals)
forecast_std = np.std(model_ar1.resid)
confidence_interval = 1.96 * forecast_std
lower_ci = forecasts_ar1 - confidence_interval
upper_ci = forecasts_ar1 + confidence_interval
# Create a DataFrame for the forecast and prediction intervals
forecast_df = pd.DataFrame({
'Year': forecast_years,
'Forecast': forecasts_ar1,
'Lower_CI': lower_ci,
'Upper_CI': upper_ci
})
# Print the forecasted values and prediction intervals
print(forecast_df)
# Function to plot the forecasts
def plot_forecast_with_intervals(train, forecast_df, title):
fig = go.Figure()
fig.add_trace(go.Scatter(x=train['Year'], y=train['Market_Value'], name='Actual'))
fig.add_trace(go.Scatter(x=forecast_df['Year'], y=forecast_df['Forecast'], name='Forecast'))
fig.add_trace(go.Scatter(x=forecast_df['Year'], y=forecast_df['Lower_CI'], name='Lower PI', mode='lines', line=dict(dash='dash')))
fig.add_trace(go.Scatter(x=forecast_df['Year'], y=forecast_df['Upper_CI'], name='Upper PI', mode='lines', line=dict(dash='dash')))
fig.update_layout(template="simple_white", font=dict(size=18), title_text=title,
width=700, title_x=0.5, height=400, xaxis_title='Year',
yaxis_title='Market Value (USD)')
return fig.show()
# Plot the forecast with prediction intervals
plot_forecast_with_intervals(filtered_data, forecast_df, "AR(1) Model Forecast with Prediction Intervals")
AutoReg Model Results
==============================================================================
Dep. Variable: Market_Value No. Observations: 24
Model: AutoReg(1) Log Likelihood -136.560
Method: Conditional MLE S.D. of innovations 91.695
Date: Fri, 12 Jul 2024 AIC 279.121
Time: 12:05:06 BIC 282.527
Sample: 1 HQIC 279.978
24
===================================================================================
coef std err z P>|z| [0.025 0.975]
-----------------------------------------------------------------------------------
const 43.5676 33.123 1.315 0.188 -21.353 108.488
Market_Value.L1 0.9445 0.082 11.551 0.000 0.784 1.105
Roots
=============================================================================
Real Imaginary Modulus Frequency
-----------------------------------------------------------------------------
AR.1 1.0588 +0.0000j 1.0588 0.0000
-----------------------------------------------------------------------------
Year Forecast Lower_CI Upper_CI
24 2024 688.485188 508.762559 868.207816
25 2025 693.817973 514.095345 873.540602
26 2026 698.854604 519.131976 878.577233
27 2027 703.611528 523.888900 883.334157
28 2028 708.104278 528.381649 887.826906
Model Summary: The AR(1) model results indicate that the market value at time (t) is primarily influenced by its value at time (t-1), with an added constant term.
Key Statistics:
Coefficients:
Roots of the Characteristic Polynomial:
The forecasts and their 95% prediction intervals for the next five years are as follows:
| Year | Forecast | Lower_PI | Upper_PI |
|---|---|---|---|
| 2024 | 688.485188 | 508.762559 | 868.207816 |
| 2025 | 693.817973 | 514.095345 | 873.540602 |
| 2026 | 698.854604 | 519.131976 | 878.577233 |
| 2027 | 703.611528 | 523.888900 | 883.334157 |
| 2028 | 708.104278 | 528.381649 | 887.826906 |
Interpretation:
Model Coefficients:
Forecasts:
Model Performance:
The AR(1) model indicates a strong dependence on the previous year's market value, suggesting a continuation of the existing trend. The forecasts show a steady increase in market value with increasing uncertainty over time. This model can be useful for short-term forecasting but should be complemented with other models or analyses for long-term predictions due to increasing uncertainty.
After comparing various machine learning regression models and traditional forecasting methods, it has been determined that machine learning model XGBoost yields better results for our given dataset. This superiority arises because forecasting models primarily consider only the market value data, whereas regression models account for the impact of multiple variables on the market value. Consequently, regression models provide more accurate forecasts by incorporating these additional factors.